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ABSTRACT 


This thesis develops a model which describes how errors 
enter and leave an operating data base for manpower manage- 
ment. The model describes the abies input process and error 
distribution in the data base. The underlying structure for 
the model is the M/G/~ queue. The model is used to determine 
the effect of a change in the input error rate on the number 
of errors in the data base. An upper limit is determined for 
this rate of increase, and a method of determining a minimum 


time between samples in the worst possible case is proposed. 
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Pe eee ODUC hE 


Decision making processes in the Department of Defense 
are not unlike those in other large governmental and non- 
governmental industrial enterprises. During the past fifteen 
years, a considerable portion of the logistics, engineering, 
and management effort has been computerized. This has re- 
sulted in a considerable number of Support and reference ADP 
files that constitute the data input for the computer. The 
files, or data bases, vary in size from 50,000 up to millions 
of records. In terms of alphanumeric characters, some of the 
files have from fifty million to ten billion characters. We 
will use the operating data base of the Navy's Bureau of 
Personnel as an example throughout this thesis, but the model 
developed is generally applicable to data bases in the logis- 
tics and engineering areas as well. 

The Active Duty Enlisted Master Magnetic Tape Record 
(E.M.T.) is the operating data base which this thesis addresses. 
It contains 550 systematically arranged alphanumeric characters 
for every enlisted man on active duty, approximately 600,000 
men. These alphanumeric characters represent such information 
as name, rate, serial number, social security number, age, 
race, religion, number of dependents, GCT/ARI scores, home of 
record, years of formal education, pay entry base date, duty 
station, and many others. For a detailed description of the 


contents of the data base, see Ref. [4]. This information is 





used by manpower managers in the Bureau of Naval Personnel 

to facilitate assignments, to fill school quotas, to determine 
force parameters, to make budget and end strength predictions 
and many other manpower management decisions. 

Inputs are made daily to the Enlisted Master Tape by 
every reporting unit in the Navy. This is done in the form 
of a unit diary. The diary is the paper that is submitted 
daily to an ADP center for editing, coding, and eventual in- 
Sertion into the E.M.T. For example, see Fig. 1. Information 
flows from the reporting units to the ADP centers to the 
change routine which alters the E.M.T. See Ref. [2]. 

The purpose of this thesis is to develop a model which 
mescribes how errors enter and leave the data base (E.M.T.) 
and to investigate how this model can be used to help design 
Sampling methods Similar to those in Standard statistical 
guality control procedures. It does not address format edit- 
ing, which is covered in Ref. [2]. This function takes place 
‘at the ADP center and during the change routine. If the format 
is not correct for the type of data element being changed, the 
computer will not perform the change and so indicates. We 
are concerned in this thesis with an after-the-fact evaluation 
of the data in the E.M.T. We are concerned, then, with techni- 
cal editing. Some examples of technical editing are these: 
correct service number, correct pay entry base date, correct 
mate, COLrect time injgrade, and correct duty Station. The 
results of the study will show that errors arrive in a 
Poisson fashion, that the number of errors in the base has a 


Poisson Distribution, and we show how this distribution changes 
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Beeoure 1. Information Flow from Reporting Units to the 
Enlisted Master Tape. 








with a change in the error input rate. A sampling method is 
described and an upper limit for the rate of increase of 
Perors in the system 1s determined. 

This thesis is written in five sections, of which this is 
the first. In Section II we model the input process by which 
errors enter the data base. We show that they enter in a 
Poisson process. In Section III we develop a model for the 
distribution of errors in the data base. Our assumptions lead 
to formulating the M/G/m queue. (The infinite server Poisson 
queue.) In Section IV we use this model to determine the 
effect of a change in the input error rate on the number of 
errors in the data base. An upper limit is determined for the 
rate of increase of the mean number of errors in the data base. 
In Section V we use the results of Section IV to help design 
error rate sampling procedures. The relationship between the 
Size of the sample and the frequency of the sampling procedure 
is described. The goal is to design a sampling procedure in 
‘the data base which will allow an early detection of signif- 


icant changes in the input error rate. 





II. ARRIVAL OF ERRORS INTO THE DATA BASE 


Assume a large number of possible places from which data 
can come each day, e.g.: 600,000 men, each with 50 data 
elements = 30 x Lae possible arrivals each day. Only about 
1000 changes occur per day, so the probability any data element 
changes is about 1000/30 x io =e 5 ito . A small propor-- 
tion of these are in error, so the probability an error arrives 
is even smaller. 

For each change to the data base which arrives, we define 
a Bernoulli random variable X,. When change 1 is made, Xs 
takes on a value of zero if change 11S correct and of one if 
change iis in error. When k changes are made, the number of 
errors which occur is a Binomial random variable if the follow- 
ing two conditions are satisfied: First, the probability of an 
error in any change 1S independent of other changes, Second, 
Mears probability 1s constant, i.e.: the X.'s are independent, 
identically distributed random variables. 

It seems reasonable to assume that the receipt of an error 
from one reporting unit is independent of the receipt of an 
error from any other reporting wnit. An exception might be a 
case in which an incorrect directive is being followed by all 
reporting units. This is the type of case which we would like 
to discover has occurred; however, for normal operations, we 
can assume it is not the case. We have no data to test whether 


or not the X,'s are identically distributed. This assumption 


implies, for example, that the probability an arriving data 





element (such as pay entry base date) is in error is the same 
no matter where it came from or when it arrives. 

The Poisson approximation to the Binomial is justified in 
cases where n is large and p is small. This is precisely the 
case under consideration. The number of possible changes, n 
in the data base is very large (= 30 x ion and p, the 
probability of receipt of an error, is very small. Thus, in 
the remainder of the thesis, we assume that errors arrive in 


the data base in a Poisson process. 
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he MODE FOR THE DISTRIBUTION OF ERRORS 


Assume that each day a number of errors arrive, this num- 
ber being a Poisson random variable. Each one enters the data 
base and remains there Oneeae at some future date, this data 
point is changed. It is changed by either of the following 
Events : : 


a. a correct updated version of the data point 


arrives and replaces what is already there, or 


b. an incorrect updated version of the data point 


arrives and replaces what is already there. 

From our Poisson assumptions, we know that the events a. 
and b. are independent. Since the probability of replacing a 
data point which ae currently in error with a new data point 
which is also in error is very small, we can assume that almost 
all the time, events of type a. are the only ones which remove 
errors from the data base, 

Thus, the time an error spends in the data base is essential- 
ly independent of the arrival stream of errors. 

Therefore the data base acts like the M/G/m queue. (The 
infinite server queue, with Poisson arrivals, and a general 
service time distribution.) 

Assume that the errors arrive at the data base at rate A, 
that each one stays in the data base a random time, and that 
service time (length of stay in the data base) is randomly 


Steir outed as G. 


dbl 





Now, using the model of the infinite server Poisson queue, 
the problem of solving for the distribution of the number of 
errors in the data base is addressed. For the M/G/~ queue, 
with arrivals at rate A and mean service time (average time a 
data point spends in the data base) of 1/u, the number of areas 
in the system in steady state has a Poisson distribution with 
mean A/u {Ross Ref. 1, p. 18]. 

In general, let X(t) denote the number of errors in the 
system at time t, where we start with no errors at time 0. We 
determine the distribution of X(t) by conditioning on N(t), the 
total number of errors which have arrived by time t. By condi- 


tioning, we obtain 


p{x(t)=j} = } p{x(t)=j[N(t)=n}e (1) 


me oe ak 
aa ne! 


The possibility that an error which arrives at time x will 
still be present at time t is 1-G(t-x). Hence, given that 
N(t)=n, it follows that the probability an arbitrary one of 
' these errors is still present at time t 1S given by 
c ic 


p = f (1-c(t-x))% = f (1-6(x)) 
0 


ax 
=| 
0 ie 


(2) 
independently of the others. This follows since we know that 


given N(t)=n, the n arrival times Ss ,S, have the same 


Aer eee 
distribution as the order statistics corresponding to n inde- 
pendent random variables uniformly distributed on the interval 


mop eys [|ROss Ref. 1, p. 17). 


eZ 





Hence, 


n 5 _ n-j 
ale e P) j=0,1,2, ee 


p{x(t)=j|N(t)=n} - os 


Thus, by (1) we have, 








a ; nN 
pix(t)=j} = J (yp) (i-py?9 eo“ & OE) 
n=) 4 ! 
Ss Atp) J 5 (At (Lop) ” 
3! n=3 n-j 
~t 5 
PANES oe 
where 
_ (1-6 (x) ) ax 
cian ) ie peeucen * 


That is, X(t) has a Poisson distribution with mean = 


os 
Af (1-G(x))ax . 
0 


is 











wy. CHANGE IN THE INPUT BRROR RATE 


For the M/G/~ queue with arrivals at rate X and mean 
service time (average time a data point spends in the data 
base) of 1/u, the number in the system (errors in the system) 
in steady state has a Poisson distribution with mean A/t [Ross 
fae lt, pp. l7, 18, 19]. 


Assume that for t<t the data base errors have been 


1? 
wee iving at a constant rate A for some time and that the sys- 
tem is in steady state. At ti, the error rate changes to a 
new rate 8 which for simplicity we assume is greater than i. 
Thus in steady state at this new rate, the expected number of 
errors in the data base is B/u. We wish to investigate how 


fast the expected number of errors reaches this level. (See 


Fig. 2.) Let X(t) = the number of errors in the data base at 
time (t); 


and Y(t,x) = the number of errors in the data base at 
time (t+x) that were in the data base 


Gyan ate mite) ee 0 ¢ 


the number of errors in the data base at 


II 


and Zee oe) 


time (t+x) that arrived during the 


} 


interval (t,t+x). 
Then 


X(ttx) = Y¥(t,x) + Z(t,x). (3) 


It 1s reasonable to assume that Y(t,x) and Z(t,x) are 
independent random variables. They are dependent to the 


extent that an incoming error might replace an error that is 


14 





E(No. errors in data base} 


B/u 


, ae =: time 
S| 4 


BPagure 2. Expected Number of Errors in the Data Base 
vs. Time. 


Ik 





already there. We assume that such an event occurs only very 
rarely. 

Our problem is to find the distribution of X(t +x). We 
do this we must find the distribution of Y(t, ,x) and Z(t,,x). 


Hohe Y(ty.X), we have 


pl¥(t, ,x) = k|X(t,) =n} 


= p{(n-k) of the errors in the system at 
time t,, have left by time (t, +x) }. 
An error which is in the system at time (ti), has left 

the system by time (t, +x) if and only if, its remaining service 
time 1S no more than x. If ty is an arbitrarily chosen point, 
then the remaining service time will be distributed the same as 
an equilibrium excess random variable [Ross Ref. 1, pp. 44-47]. 
That is, let service time (S) be distributed as G, and 


E(S)=1l/u, then the remaining service time at t, for an arbi- 


1 
trary error is distributed Gly) where 
y 
CR eh (Grins) aul (4) 
0 
With this notation, we have 
_ _ n\= k n-k 
p{y(t,,x) [X(t,)=n} =  (,)G,(x)"G, (x) (5) 


where G(x) =1-G_ (x). 


We also know that [see Ross Ref. 1, pp. 18, 19] 


yy 1G} eA fH 


pixX(t,)=n} = (=) 


0} n! n=07 ee ree e (6) 





‘By Conadteloning on X(t,) we have that 


eG 





The inequality (10), when used in (9) shows that the 


true mean value function is bounded above by, 


IA 


E(X(t,+x)) A/u + (B-A)x tiex sly 


IA 


B/u Tae org 2) L/h 


This is shown in Fig. 3. 

Figure 3 is useful in that it shows the maximum rate at 
which Ae mean number of errors in the data base adjusts to 
its new equilibrium value when the error input rate changes. 
This will give us an idea of how frequently to sample the 
data base, to see if error rates are changing. This we dis- 


cussed in Section V. 
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E({No. errors in data base} 
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0 —-+ 


—_ 
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27S EX (t, 41/0) ] 


A/¥ | , 


ge > Time 
1 t,+1/ 


meaqure 3. Upper Bound on the Mean of a Changing Input Error 
hace. 
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V. SAMPLING PROCEDURE AND STATISTICAL QUALITY CONTROL 


Quality control of, and sampling from, a population which 
has numerous input streams all subject to human error can be 
approached using standard quality control procedures if the 
following assumption is made: many separate reporting units 
following the same set of instructions with regard to diary 
and service record entries act as a Single system, 

This assumption is the basis for the quality control 
procedures now being used in a naval manpower data base, the 
U.S. Marine Corps Manpower Management System (see Ref. [2]). 
The method used is as follows: In order to determine the 
fraction of errors in the data base, a sample of 2500 (out of 
about 200,000) service records is randomly selected and compared 
with the source documents. The sampled data is sent, via U.S. 
mail, to the individual's reporting unit, under a cover letter 

asking for match/mismatch information between the information 
in the data base and the correct records at his reporting unit. 
Verification of the data is limited to the following: match, 
mresmatch, or “can't find." The first indicates no error. The 
second indicates an error. The third indicates a case which 
arises when an individual is in transit between duty stations 
or not at his last known command. This last case could occur 
for many ASABE. Lease and temporary duty assignments else- 
where being the most likely. These cases are removed from the 


sample, and the fractional error rate for a given type of data 
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element is found by dividing the total mismatches by the sum 
of the mismatches and matches. 

When the match/mismatch information is returned, the mean 
and variance of the fraction of errors for each element is 
determined as follows. Any process which generates output 
that can be characterized as either "correct" or "incorrect" 
for which each generalizing event (trial) is independent in 
the sense that it is not influenced by prior events and does 
not influence subsequent events and which can be described by 
a Single parameter giving the probability of correct (or incor- 
rect) events, is called a Bernoulli process. The probability 
of exactly c correct (or incorrect) events inn trials of such 
a process for the parameter p iS given by the binomial distribu- 
tion. For purposes of this study, the finite population of 
elements in the data base is so large that we have assumed the 
population to be infinite. This assumption is common in cases 
of acceptance sampling [Fetter Ref. 3, Chapter 1]. 

| Let the sample size be n, and let the number of errors be 


a random variable X, where X is distributed Binomial (n,p). 


Then 
aX a, ed. Z 
pee) , so that E(p) = E (=) = IO) —(np) p 
Also, 
a X al a 
Var(p) = Var (=) = —z Var (X) = —> npg = Re ' 
n n 
Then 


ZL 











Now, we know that E(X) Npywand Vart%) = mpg, thus 


O —e_ 
O — Yynpq ee = = yea = as (1a) 


To estimate the Variance of X from our Single sample, define 


um = Var (X). Then let 


5 ee n(%) (1-%) = eae x (12) 


Now we determine if this estimate is biased, and we find that 


E(6_,7] = Bix Xo) = E[X] - ae 
= np - = Enel = np - = [np (1-p) +n“p7] 
= (n-1)p(l-p), 
since 
E[x*] = var{x] + (ex}}* = npq + np? 


Thus we see that the above definition of oa is a biased estimate 


of a We can eliminate the bias by introducing the factor 


(<>) into Eq. (12). Thus,we redefine 


x2 n X 
OF. 1a E X(1 - A) : (13) 


It can be seen from the following calculations that this 
is an unbiased estimate of the variance. 


B(6,.*] = =, (etx) - + B[x?)) 


—r inp - =(np(1-p) + n°p*)] 


I 


al (n-1)p(1-p)] = npq . 
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With this estimate of the variance, the width of the 953 
confidence interval can be determined as followed: for the 


sample of 2500 let X = 250. 


Then 
~ — 250 _ 
EPs Sen a ie ae 
Then by (13), 
6 = 2 Be aS) 3 ane 
x n-L . 
Thus 
ae Sis ane oa Se a ee 
—s p 2500 : : 


With this estimate of the standard deviation of the fraction 
‘of error (Fig. 4), we see that the 95% confidence interval is 
of width 2 x (1.96) x (.006) = .024, since for large n, 
binomial probabilities may be approximated by the normal distribu- 
tion. The values of the parameters determined from this sample 
are taken to be the population parameters. When considering 
samples of size 200,000 (the whole population) and of size 200, 
with reference to the sample 2500, plotted in Fig. 4, it can be 
een that the width of the confidence interval is inversely 
proportional to the sample size. For a sample which is equal 
to the whole population, the fraction of errors discovered 
would not be an eStimate, but would in fact be the true frac- 
tion of errors. The width of the confidence interval would be 
zero, as shown by the line at .1 in Fig. 4. 


For a sample of size 200, by Eq. (13) we would have: 


aX 2 n 200 

cere = =a ee (i==) = Tog (29) eo) e116 ; 
thus 

& _ n ADS re 

oe = AWD and O = Dame = FOZ) 
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Figure 4, Comparison of Confidence Interval Width for Three 
Different Sample Sizes, 
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Then the 95% confidence interval is of width 2 x (1.96) x 
(.021) = .084. 

Sample size can be determined using the above procedure if 
we know the confidence level and interval width we desire and 
the population parameter (p). For example, if p=.l, and the 
half width of the interval, say a, = .02, and if we deSire a 


95% level of confidence in this interval, then we know that 


1.96(G4) = .02 
Pp 
A _ .02 
°5 = Fone a Or 
then by (11) we have that 
; c., | o 
SOME = °s = a or n = mans 
thus x 
n = i. = NPY 
Aoi SO 
thus 
n2 Be npq 
mOwions 
= Pee le 
3 OO0L 0001 ON 


See Figs. 5 and 6 for a plot of sample sizes vs. width of the 
confidence interval for p=.l1 and .05 respectively. 

Now having solved for the mean and variance of the fraction 
of errors in a sample of having determined the sample size 
which must be used in order to have a desired confidence 
interval, we are now able to address the question of frequency 
and sampling in terms of the minimum time between samples. 

In order to be able to control the error rate, we must 
be able to predict its behavior. By the methods just described, 


ZS 
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Figure 5, Plot os Sample Size vs. Confidence Interval 
Mioatiy for p=.1. 
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Figure 6. Plot of Sample Size vs. Confidence Interval Width for p=.05. 
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we are able to determine the limits within which the error 
rate from a sample should lie. By aSsuming that the same set 
of causal factors will continue to operate in the future, it 
is usually possible to make a prediction of the expected be- 
havior of the system. Then, if a change occurs in the causal 
system which changes the error rate, this fact should be 
quickly apparent through an increase in the error rate in the 
samples. We then attempt to discover the cause of the increase 
and eliminate it. The time between samples becomes an important 
parameter when early detection of changes in error rate is 
Gesired. See the quality control chart in Fig. 7 [Fetter Ref. 
3, Chapters 1 through 3]. 

As long as the fraction of errors stays within the upper 
and lower confidence bounds, we say that it is "in control." 
When it leaves this interval on the upper side, we say it is 
mee Of Control.” If we were to pick limits of +1.96 (a) for 
Our confidence interval, we would have a 95% confidence interval. 
‘That is, only five times out of a hundred would we think the 
process was "in control" when it was actually "out of control." 
If we were to increase the limits of our confidence interval to 
+30, we would then have a 99+% confidence interval; that is, only 
three times out of a thousand would we mistakenly infer that the 
system was "in control." The wider the limits, the greater con- 
fidence we have in our interval. On the other hand, wider 
limits decrease the probability that a change in the process 


will be detected quickly. 
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Peeure 7. Typical Quality Control Chart. 


20 





Figure 8 shows changes in the average error from x) to X 
with their corresponding normal densities. The shaded area 
represents the only place on the graph of the new density 
where an error rate will be determined to have changed, since 
any other place in the new density is still within the limits 
of the old density. The shaded area then is the probability 
that the shift will be detected by any sample taken following 
the change. 

Recall Fig. 3 in Section IV for a change in input error 
rate. Now consider the worst possible case of input errors, 
when every change which arrives at the data base 1S an error. 
Then if at time t and thereafter every entry to the data base 
is an error, the number of errors in the data base will increase 
at itS maximum rate. The upper bound on the fraction of errors 
present at time t, say p(t), 1s shown in Fig. 9 as the solid 
diagonal line. 

We draw confidence limits about p(t) just as we do about 
‘the steady state lines. From Fig. 9 it can be seen that unless 
the lower confidence bound on p(t) exceeds the upper confidence 
bound on the p(0), we will probably not detect a change in the 
em@eor rate. 

The equation of p(t) is known since the average number of 
changes per day is known, and Since all of these changes are 
assumed to be in error. We number the time scale so that the 


Giagonal starts at t=0. 


p(t) = p(0) +u(l-p(0))t; Oo<t< 1f, (14) 
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Figure 8, Change in Average Error and Normal Densities. 
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where 1/fu is the time an error spends in the data base 
(assumed constant to obtain the upper bound in Fig. 9). Let 
2a(t) be the width of the 95% confidence interval at t. Then 
for a sample of sizen, 


ee OO 
oes 


10) 


a(t) 


lt 


ptEy(I=pttly 
1.96/Piti Nop tt 7, 20 Seer 


(15) 


Let F(t) be the lower confidence limit at t. Then 


I 


Bic) Ge) se caer 


-_ 
ages oy ee )(i=p ey ele) 


The upper confidence limit at t=0 is 


ae 
BO) a0) = ere) 2 ance = CEE D oe (0) (17) 


‘We take the minimum time between samples to be that t for 
which (16) and (17) are equal. That is, tee the time between 


Samples, satisfies 


a p(0) (1=p(0)) 
oe p(0) + 1.26) Se ae . (18) 


For example, consider the case where p(0)=.1, a sample size 
n of 1000, a confidence interval of 95%, and a time in the 


data base of 300 days. Then 
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meeure 9. Confidence Interval for an IncreaSing Error Rate. 


oe 





pet) = a, 6t<0 
and 

pitt). = jolie tec 003t oS U<t<300 
Also 

2 
1000 -"= D 

Thus 

a0(0) = .019 and p(0) = .l. 


From Eq. (18) we find that 
t. = 14 days. 


= 


That is, the minimum sampling period to consider is two weeks. 
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